Improvements in Handwritten and Printed Text Separation in Historical Archival Documents
نویسندگان
چکیده
The presence of handwritten text and annotations combined with typewritten machine-printed in historical archival records make them visually complex, posing challenges for OCR systems accurately transcribing their content. This paper is an extension [1], reporting on improvements the separation from (including typewriters), by use FCN-based models trained datasets created different data synthesis pipelines. Results show a significant increase about 20% intrinsic evaluation artificial test sets, 8% improvement extrinsic subsequent task real documents.
منابع مشابه
Discrimination between Printed and Handwritten Text in Documents
Recognition techniques for printed and handwritten text in scanned documents are significantly different. In this paper, we propose method to automatically identify the signature in the scanned document images. This helps to retrieve the document images based on the signature. A simple region growing algorithm is used to segment the document into a number of patches. A patch is composed of many...
متن کاملText-image alignment for historical handwritten documents
We describe our work on text-image alignment in context of building a historical document retrieval system. We aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten lines are automatically segmented from the scanned pages of historical documents and then manually transcribed. To train automatic routines to detect words in an image of hand...
متن کاملHandwritten Text Recognition for Historical Documents
The amount of digitized legacy documents has been rising dramatically over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents. The vast majority of them remain waiting to be transcribed into a textual electronic format (such as ASCII or PDF) that would provide historians and other researchers new ways of indexing, consulting and que...
متن کاملHandwritten and Printed Text Separation in Real Document
The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudolines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseud...
متن کاملText Extraction from Historical Handwritten Documents by Edge Detection
Many national archives or libraries keep large amount of historical handwritten documents. One problem that many archivists are facing is the sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage. The result is that the handwritten characters from the reverse side appear as noise on the front side and even interfere with the front side char...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Archiving
سال: 2023
ISSN: ['2161-8798', '2168-3204']
DOI: https://doi.org/10.2352/issn.2168-3204.2023.20.1.7